Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage by GroupReadsByUmi in a corner case #774

Merged
merged 6 commits into from
Feb 20, 2022

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Feb 17, 2022

No description provided.

@tfenne tfenne self-assigned this Feb 17, 2022
@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2022

Codecov Report

Merging #774 (771c87a) into master (9054893) will decrease coverage by 0.07%.
The diff coverage is 80.55%.

❗ Current head 771c87a differs from pull request most recent head 84ef127. Consider uploading reports for the commit 84ef127 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #774      +/-   ##
==========================================
- Coverage   95.57%   95.50%   -0.08%     
==========================================
  Files         119      119              
  Lines        6805     6830      +25     
  Branches      476      450      -26     
==========================================
+ Hits         6504     6523      +19     
- Misses        301      307       +6     
Flag Coverage Δ
unittests 95.50% <80.55%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...cala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala 94.41% <78.78%> (-2.46%) ⬇️
...in/scala/com/fulcrumgenomics/umi/CorrectUmis.scala 98.70% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9054893...84ef127. Read the comment docs.

Comment on lines +189 to +190
logger.warning(s"Read (${rec.name}) detected with unexpected length UMI(s): ${sequences.mkString(" ")}.")
logger.warning(s"Expected UMI length: ${umiLength}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the unrelated change here. I had a stupid type that took me far too long to figure out because I used -u instead of -U and CorrectUmis happily decided my filename was the sole UMI sequence to correct to. It the message here had told me my expcted UMI length was 30+ that would have helped!

@tfenne tfenne marked this pull request as ready for review February 18, 2022 00:00
@tfenne tfenne requested a review from nh13 February 18, 2022 00:00
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a subtle change to the behavior which we could think of as an improvement, but it does change the behavior, so we should discuss.

iterator.hasNext &&
firstEnds == ReadInfo(iterator.head.r1.get) &&
// This last condition only works because we put a canonicalized UMI into rec(assignTag) if canTakeNextGroupByUmi
(!canTakeNextGroupByUmi || firstUmi == iterator.head.r1.get.apply[String](this.assignTag))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does change behavior in a subtle way. Suppose we have three templates that have all but the same assign tag. Let's say AAAA, GGGGG, GGGGT, with --min-umi-length=3.

Previously, all templates would be read into memory, and truncateUmis would truncate to the length of smallest UMI observed in the group of templates, in this case 4bp long due to the AAAA (not 3 as per the command line!). So the three templates would have UMIs AAAA->AAAA, GGGGG->GGGG, and GGGGT->GGGG. So we'd assign two unique molecules (AAAA and GGGG, with the last molecule containing the last two reads).

In the new implementation, we truncate the raw UMI bases based on --min-umi-length=3 to set MI for sorting. So we'd truncate to length 3 for sorting: AAAA->AAA, GGG->GGG, and GGG->GGG. When we read back in after sorting, we read in the first template by itself (only read with MI:AAA), so no truncation is applied and it stays the same (AAAA). We then read in all templates with MI having GGG, which gives the second two templates. Now we go back to the raw UMIs to find the length of the shortest UMI of the two. Both are 5bp long, so we do not truncate and keep them the same (GGGGG and GGGGT). But now these UMIs differ, so we get three molecules!

One could argue that the new implementation is an improvement, but it does change behavior subtly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, that is an interesting point. So basically if you have variable length single UMIs and have edits = 0 the behavior will be subtly different. My instinct is to call this an improvement and move on, but I don't have a great sense of who (or on what kind of data) the variable length support is used, so I'm not 100% sure.

@tfenne tfenne merged commit 7176170 into master Feb 20, 2022
@tfenne tfenne deleted the tf_group_reads_by_umi_speed_tweaks branch February 20, 2022 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants